This report contains the result of comparing the results of the RNAseq analysis reported in Braun et al,2022 (https://insight.jci.org/articles/view/154183) with the results obtained by myself using and modifiying the code provided by Prof. Gomez-Cabrero on the class.
Getting the counts from GEO
urld <- "https://www.ncbi.nlm.nih.gov/geo/download/?format=file&type=rnaseq_counts"
path <- paste(urld, "acc=GSE198256", "file=GSE198256_raw_counts_GRCh38.p13_NCBI.tsv.gz", sep="&");
GSE198256_count <- as.matrix(data.table::fread(path, header=T, colClasses="integer"), rownames=1)
Getting the metadata
## Loading required package: Biobase
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which.max, which.min
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
## https://bioconductor.org/packages/release/bioc/vignettes/GEOquery/inst/doc/GEOquery.html
gds <- getGEO("GSE198256")
## Found 1 file(s)
## GSE198256_series_matrix.txt.gz
Meta_GSE198256 <- pData(gds$GSE198256_series_matrix.txt.gz@phenoData)
Meta_GSE198256 <- Meta_GSE198256[,c("title","source_name_ch1","characteristics_ch1","characteristics_ch1.1","description","cell type:ch1","disease state:ch1")]
Factors_GSE198256 <- Meta_GSE198256[,c("disease state:ch1")]
initial number of genes:
## [1] 39376
Then we collect relevant biological information to later perform the filtering with NOISeq For this, we first export the gene’s id to manually look for the relevant biological information in BioMart. This process could be also done using the BiomaRt package from R (I tried since I have used it before but I got an error, apparently because I am calling one of the attributes with the incorrect name, however I did not find any discrepancy between my call and the attribute list. I need to figure out what happened.)
# Write the gene ids to a file, for uploading on BioMart web.
write.table(rownames(GSE198256_count),"gene_names.entrez.txt",
col.names = FALSE,row.names = FALSE,quote=F)
# Import the information
annotgene <- read.csv("./Resources/mart_export (10).txt",sep="\t",header = T)
Let’s check how many genes I get annotated
sum(rownames(GSE198256_count) %in% annotgene$Entrezgene)
## [1] 25898
And the original number of genes in our count matrix was
nrow(GSE198256_count)
## [1] 39376
As we can see we get annotations for less genes than those contained in the original table Since we want to use NOISeq for a quality control is ok to limit our genes to those that got annotations, however for posterior analysis we will use all the genes. Meanwhile, let’s filter the genes.
# Filter the genes to keep only those annotated to cannonical chromosomes
annotgene <- annotgene[annotgene$Chromosome %in% c(as.character(1:22) ,"X","Y"),]
The number of genes is reduced to:
sum(rownames(GSE198256_count) %in% annotgene$Entrezgene) #25872
## [1] 25872
Some genes can also have more than one annotation, so lets remove the duplicates
annotgene_filt <- annotgene[!duplicated(annotgene$Entrezgene),]
sum(rownames(GSE198256_count) %in% annotgene$Entrezgene)
## [1] 25872
sum(annotgene_filt$Entrezgene %in% rownames(GSE198256_count))
## [1] 25872
then, after removing duplicates we can asign the genenames as the rownames of our object
## Overlap between annotation and genes
rownames(annotgene_filt) <- as.character(annotgene_filt$Entrezgene)
sum(as.character(rownames(annotgene_filt)) %in% rownames(GSE198256_count))
## [1] 25872
Now, we can work with the annotated genes
GSE198256_count_filt <- GSE198256_count[rownames(GSE198256_count) %in% rownames(annotgene_filt),]
GSE198256_count_exc <-GSE198256_count[!(rownames(GSE198256_count) %in% rownames(annotgene_filt)),]
annotgene_ord <- annotgene_filt[rownames(GSE198256_count_filt ),]
#let's check that the size are ok
sum(rownames(annotgene_ord)==rownames(GSE198256_count_filt))
## [1] 25872
NOISeq is a R package for quality control of count data, it allow us to identify and control outliers or weather our data has any bias. With NOISeq we can monitor the quality of our data, take better decisions to improve our analysis, and be aware about the features of our data that can bias the interpretation of our results. For quality control and exploration NOISeq require as an input the counts, a factor (the variable of interest in our analysis/comparison) and some biological information: such as gene size, localization in chromosome, gene byotype and GC content.
## Loading required package: splines
## Loading required package: Matrix
We are going to use the annotations we collected from BioMart to extract the additional biological information required by NOISeq, all the information is in annotgene_ord
## GC start end Chromosome type Entrezgene
## 100287102 57.50 11869 14409 1 lncRNA 100287102
## 102466751 61.76 17369 17436 1 miRNA 102466751
## 100302278 31.16 30366 30503 1 miRNA 100302278
Now, from the previous one let’s extract the biological data in the correct format required by NOISeq > gene id and value.
#Adding additional biological data
lengthuse <- abs(annotgene_ord$end-annotgene_ord$start)
names(lengthuse) <- rownames(annotgene_ord)
gc <- annotgene_ord$GC
names(gc) <- rownames(annotgene_ord)
biotype <-annotgene_ord$type
names(biotype) <- rownames(annotgene_ord)
chromosome <- annotgene_ord[,c("Chromosome","start","end")]
Now, lets define the factors
Factors_GSE198256 <- data.frame(Meta_GSE198256 [ colnames(GSE198256_count_filt),c("disease state:ch1")])
colnames(Factors_GSE198256)[1]<- "Group"
Now, we have all the necessary information, let’s create the NOISeq object.
data_NOISEQ <- readData(data = GSE198256_count_filt,
length=lengthuse,
gc=gc,
biotype= biotype,
chromosome = annotgene_ord[,c("Chromosome","start","end")],
factors = Factors_GSE198256)
Next, we can generate some exploratory plots to evaluate the quality of our data (I am not generating all available, just the most representatives)
Byotype detection:plot the abundance of the different biotypes (e.g protein coding, lnc-RNA etc) in the genome with % of genes detected in the sample/condition and within the sample/condition.
## Biotypes detection is to be computed for:
## [1] "GSM5942339" "GSM5942340" "GSM5942341" "GSM5942342" "GSM5942343"
## [6] "GSM5942344" "GSM5942345" "GSM5942346" "GSM5942347" "GSM5942348"
## [11] "GSM5942349" "GSM5942350" "GSM5942351" "GSM5942352" "GSM5942353"
## [16] "GSM5942354" "GSM5942355" "GSM5942356" "GSM5942357" "GSM5942358"
## [21] "GSM5942359" "GSM5942360" "GSM5942361" "GSM5942362" "GSM5942363"
## [26] "GSM5942364" "GSM5942365" "GSM5942366" "GSM5942367" "GSM5942368"
## [31] "GSM5942369" "GSM5942370" "GSM5942371" "GSM5942372"
Saturation:Represent the number of detected genes (counts>0) per sample across different sequencing depths. It is also posible to generate the saturation plot for specific byotypes, like in the second example below. In that case represent the number of detected genes of the specific biotype across sequencing depths.
Length bias:Mean gene expression per each length bin. A fitted curve and diagnostic test are also produced.
## [1] "Warning: 4402 features with 0 counts in all samples are to be removed for this analysis."
## [1] "Length bias detection information is to be computed for:"
## [1] "Covid19: Acute infection" "Covid19: Recovery 3Mo"
## [3] "Covid19: Recovery 6Mo" "Healthy"
## [1] "Covid19: Acute infection"
##
## Call:
## lm(formula = datos[, i] ~ bx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.911 -4.683 -0.612 2.443 76.882
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.856 10.112 0.678 0.49938
## bx1 5.307 10.404 0.510 0.61112
## bx2 18.797 11.096 1.694 0.09352 .
## bx3 31.606 12.199 2.591 0.01108 *
## bx4 34.033 12.907 2.637 0.00978 **
## bx5 44.134 14.821 2.978 0.00368 **
## bx6 26.317 18.781 1.401 0.16440
## bx7 24.610 28.088 0.876 0.38315
## bx8 36.512 54.607 0.669 0.50535
## bx9 -21.023 111.549 -0.188 0.85092
## bx10 39.474 95.600 0.413 0.68061
## bx11 14.260 16.692 0.854 0.39508
## bx12 NA NA NA NA
## bx13 NA NA NA NA
## bx14 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.11 on 95 degrees of freedom
## Multiple R-squared: 0.5433, Adjusted R-squared: 0.4904
## F-statistic: 10.28 on 11 and 95 DF, p-value: 4.066e-12
##
## [1] "Covid19: Recovery 3Mo"
##
## Call:
## lm(formula = datos[, i] ~ bx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.592 -4.774 -0.605 2.684 119.975
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.926 13.907 0.570 0.5701
## bx1 6.998 14.309 0.489 0.6259
## bx2 11.385 15.261 0.746 0.4575
## bx3 26.902 16.778 1.603 0.1122
## bx4 26.663 17.752 1.502 0.1364
## bx5 43.826 20.385 2.150 0.0341 *
## bx6 19.515 25.831 0.755 0.4518
## bx7 27.683 38.632 0.717 0.4754
## bx8 23.538 75.105 0.313 0.7547
## bx9 2.597 153.423 0.017 0.9865
## bx10 24.305 131.487 0.185 0.8537
## bx11 18.537 22.957 0.807 0.4214
## bx12 NA NA NA NA
## bx13 NA NA NA NA
## bx14 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.91 on 95 degrees of freedom
## Multiple R-squared: 0.2976, Adjusted R-squared: 0.2163
## F-statistic: 3.66 on 11 and 95 DF, p-value: 0.0002395
##
## [1] "Covid19: Recovery 6Mo"
##
## Call:
## lm(formula = datos[, i] ~ bx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.922 -4.884 -0.487 2.432 73.086
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.139 10.008 0.813 0.41807
## bx1 5.125 10.296 0.498 0.61981
## bx2 18.789 10.981 1.711 0.09034 .
## bx3 25.179 12.073 2.085 0.03971 *
## bx4 30.032 12.774 2.351 0.02079 *
## bx5 40.498 14.668 2.761 0.00692 **
## bx6 18.847 18.588 1.014 0.31318
## bx7 25.719 27.799 0.925 0.35722
## bx8 27.586 54.044 0.510 0.61093
## bx9 -6.744 110.400 -0.061 0.95142
## bx10 29.213 94.615 0.309 0.75819
## bx11 15.187 16.520 0.919 0.36025
## bx12 NA NA NA NA
## bx13 NA NA NA NA
## bx14 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.01 on 95 degrees of freedom
## Multiple R-squared: 0.4577, Adjusted R-squared: 0.395
## F-statistic: 7.29 on 11 and 95 DF, p-value: 6.795e-09
##
## [1] "Healthy"
##
## Call:
## lm(formula = datos[, i] ~ bx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.506 -5.049 -0.397 2.055 122.350
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.668 14.151 0.613 0.5417
## bx1 6.170 14.559 0.424 0.6727
## bx2 11.731 15.528 0.755 0.4518
## bx3 29.188 17.072 1.710 0.0906 .
## bx4 28.229 18.063 1.563 0.1214
## bx5 47.408 20.742 2.286 0.0245 *
## bx6 17.702 26.284 0.673 0.5023
## bx7 28.622 39.308 0.728 0.4683
## bx8 30.383 76.420 0.398 0.6918
## bx9 -11.080 156.109 -0.071 0.9436
## bx10 34.762 133.789 0.260 0.7956
## bx11 17.101 23.359 0.732 0.4659
## bx12 NA NA NA NA
## bx13 NA NA NA NA
## bx14 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.15 on 95 degrees of freedom
## Multiple R-squared: 0.3359, Adjusted R-squared: 0.259
## F-statistic: 4.368 on 11 and 95 DF, p-value: 2.773e-05
GC Bias: Mean gene expression per each GC content bin.A fitted curve and diagnostic test are also produced.
## [1] "Warning: 4402 features with 0 counts in all samples are to be removed for this analysis."
## [1] "GC content bias detection is to be computed for:"
## [1] "Covid19: Acute infection" "Covid19: Recovery 3Mo"
## [3] "Covid19: Recovery 6Mo" "Healthy"
## [1] "Covid19: Acute infection"
##
## Call:
## lm(formula = datos[, i] ~ bx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9344 -2.5596 -0.4698 1.9421 19.6481
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.904 5.181 0.174 0.86186
## bx1 3.783 7.327 0.516 0.60691
## bx2 3046.020 2687.824 1.133 0.26002
## bx3 -69.232 35.113 -1.972 0.05162 .
## bx4 53.828 8.491 6.340 8.19e-09 ***
## bx5 31.939 6.215 5.139 1.52e-06 ***
## bx6 34.478 5.953 5.792 9.41e-08 ***
## bx7 18.588 6.018 3.089 0.00265 **
## bx8 19.065 6.101 3.125 0.00237 **
## bx9 15.034 6.378 2.357 0.02051 *
## bx10 8.934 7.150 1.250 0.21461
## bx11 15.093 9.553 1.580 0.11751
## bx12 -11.281 32.776 -0.344 0.73148
## bx13 27.900 211.296 0.132 0.89524
## bx14 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.181 on 93 degrees of freedom
## Multiple R-squared: 0.8153, Adjusted R-squared: 0.7895
## F-statistic: 31.58 on 13 and 93 DF, p-value: < 2.2e-16
##
## [1] "Covid19: Recovery 3Mo"
##
## Call:
## lm(formula = datos[, i] ~ bx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9195 -2.9800 -0.9105 2.2264 17.4665
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8506 5.4641 0.156 0.876624
## bx1 2.9233 7.7274 0.378 0.706065
## bx2 2834.6358 2834.7400 1.000 0.319923
## bx3 -80.6933 37.0322 -2.179 0.031858 *
## bx4 49.5330 8.9550 5.531 2.90e-07 ***
## bx5 25.3094 6.5544 3.861 0.000208 ***
## bx6 31.3267 6.2787 4.989 2.81e-06 ***
## bx7 18.5210 6.3472 2.918 0.004417 **
## bx8 18.1961 6.4341 2.828 0.005735 **
## bx9 16.1737 6.7266 2.404 0.018179 *
## bx10 10.1207 7.5409 1.342 0.182826
## bx11 14.6263 10.0746 1.452 0.149923
## bx12 -5.9216 34.5674 -0.171 0.864356
## bx13 -0.7437 222.8450 -0.003 0.997344
## bx14 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.464 on 93 degrees of freedom
## Multiple R-squared: 0.7038, Adjusted R-squared: 0.6623
## F-statistic: 16.99 on 13 and 93 DF, p-value: < 2.2e-16
##
## [1] "Covid19: Recovery 6Mo"
##
## Call:
## lm(formula = datos[, i] ~ bx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6239 -2.9235 -0.7266 2.3116 22.8849
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.194 5.369 0.222 0.824496
## bx1 2.940 7.593 0.387 0.699447
## bx2 2280.998 2785.270 0.819 0.414908
## bx3 -64.502 36.386 -1.773 0.079551 .
## bx4 47.850 8.799 5.438 4.32e-07 ***
## bx5 25.740 6.440 3.997 0.000128 ***
## bx6 32.896 6.169 5.332 6.77e-07 ***
## bx7 19.862 6.236 3.185 0.001971 **
## bx8 20.614 6.322 3.261 0.001553 **
## bx9 17.647 6.609 2.670 0.008952 **
## bx10 12.582 7.409 1.698 0.092837 .
## bx11 17.830 9.899 1.801 0.074915 .
## bx12 -7.822 33.964 -0.230 0.818354
## bx13 -3.508 218.956 -0.016 0.987251
## bx14 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.369 on 93 degrees of freedom
## Multiple R-squared: 0.6822, Adjusted R-squared: 0.6378
## F-statistic: 15.36 on 13 and 93 DF, p-value: < 2.2e-16
##
## [1] "Healthy"
##
## Call:
## lm(formula = datos[, i] ~ bx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.078 -3.014 -0.553 2.167 21.200
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8403 5.2538 0.160 0.87328
## bx1 3.6323 7.4300 0.489 0.62608
## bx2 3043.7543 2725.6463 1.117 0.26700
## bx3 -74.4242 35.6070 -2.090 0.03933 *
## bx4 54.1499 8.6104 6.289 1.03e-08 ***
## bx5 28.4345 6.3022 4.512 1.88e-05 ***
## bx6 32.8538 6.0370 5.442 4.25e-07 ***
## bx7 18.7056 6.1029 3.065 0.00285 **
## bx8 18.0965 6.1865 2.925 0.00433 **
## bx9 15.1394 6.4677 2.341 0.02138 *
## bx10 9.3411 7.2507 1.288 0.20084
## bx11 14.8243 9.6869 1.530 0.12933
## bx12 -8.8117 33.2371 -0.265 0.79151
## bx13 12.1562 214.2689 0.057 0.95488
## bx14 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.254 on 93 degrees of freedom
## Multiple R-squared: 0.783, Adjusted R-squared: 0.7527
## F-statistic: 25.82 on 13 and 93 DF, p-value: < 2.2e-16
Exploratory PCA: Principal component analysis score plots for PC1 vs PC2
An essential step in an RNA-Seq study is normalization, in which raw data are adjusted to account for factors that prevent direct comparison of expression measures. Errors in normalization can have a significant impact on downstream analysis, such as inflated false positives in differential expression analysis(Evans et al, 2018).There are multiple normalization methods.Some of them are:
RPKM:Is a normalization by library size, aims to remove differences in sequencing depth simply by dividing the total number of reads in each sample.The RPKM method add a component to account for gene lenght as well. FPKM and ERPKM are variants of this method. After dividing by library size, the normalized counts reflect the proportion of total mRNA/cell taken up by each gene. If the total mRNA/cell is the same across conditions, this proportion reflects absolute mRNA/cell for each gene (Evans et al, 2018). In R we can do it using NOISeq as:
myRPKM = rpkm(assayData(data_NOISEQ)$exprs, long = lengthuse, k = 0, lc = 1)
UQUA:The Upper Quartile normalization is a method of normalization by distribution. The Upper Quartile normalization divides each read count by the 75th percentile of the read counts in its sample (Bullard et al, 2010). In R we can do it with NOISeq as:
myUQUA = uqua(assayData(data_NOISEQ)$exprs, long = lengthuse, lc = 0.5, k = 0)
TMM:The Trimmed mean of the M-values (TMM) is also a normalization by distribution, this approach chooses a sample as a reference sample, then calculate fold changes and absolute expression levels relative to that sample. The genes are trimmed twice by these two values, to remove differentially express genes, then the trimmed mean of the fold changes is found for each sample. Read counts are scaled by this trimmed mean and the total count of their sample. With NOISeq the TMM normalization can be applied as follows:
myTMM = tmm(assayData(data_NOISEQ)$exprs, long = 1000, lc = 0)
We start by creating the DESeq2 object for the analysis
library(DESeq2)
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:Matrix':
##
## expand, unname
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## The following object is masked from 'package:Biobase':
##
## rowMedians
GSE198256_DESeq2 <- DESeqDataSetFromMatrix(countData = GSE198256_count_filt,
colData = pData(data_NOISEQ),
design = ~ Group)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
Now, we will Modifying the labels to match the paper format and avoid warnings due to unacepted characters.
pDataUSE <- pData(data_NOISEQ)
pDataUSE[pDataUSE=="Healthy"] <- "Control"
pDataUSE[pDataUSE=="Covid19: Acute infection"] <- "Acute"
pDataUSE[pDataUSE=="Covid19: Recovery 3Mo"] <- "Early_recovery"
pDataUSE[pDataUSE=="Covid19: Recovery 6Mo"] <- "Late_recovery"
pDataUSE[,1] <- as.factor(pDataUSE[,1])
And let’s try again
GSE198256_DESeq2 <- DESeqDataSetFromMatrix(countData = GSE198256_count_filt,
colData = pDataUSE,
design = ~ Group)
## it appears that the last variable in the design formula, 'Group',
## has a factor level, 'Control', which is not the reference level. we recommend
## to use factor(...,levels=...) or relevel() to set this as the reference level
## before proceeding. for more information, please see the 'Note on factor levels'
## in vignette('DESeq2').
Now, in the paper the authors mentioned they performed the DE analysis by comparing the different groups of COVID19 patients with the healthy controls. So, let’s define the Controls as our reference group for the contrasts by using relevel function.
GSE198256_DESeq2$Group <- relevel(GSE198256_DESeq2$Group, ref="Control")
With the current model we are including all genes from GSE198256_count_filt in our analysis, but we can remove the genes with low expression to avoid noise and bias in our DEGs. To do so, we will use the filtering criteria suggested by prof. Gomez-Cabrero in which we retain genes genes with counts >=10 in at least 6 samples (the size of the smallest group in our contrasts, that in this case correspond to the Early recovery group (n=6))
smallestGroupSize <- 6
keep <- rowSums(counts(GSE198256_DESeq2) >= 10) >= smallestGroupSize
GSE198256_DESeq2_F <- GSE198256_DESeq2[keep,]
nrow(GSE198256_DESeq2_F)
## [1] 14055
Comparison with the paper This is our first difference with Brauns et al,2022. Since their filtering criteria was to keep genes with counts >=20 in at least one sample. This filtering is more relaxed than ours, and using it we ended with more genes as we can see below:
#Trying the filtering criteria used in the paper
keep_paper <- rowSums(counts(GSE198256_DESeq2) >= 20) >= 1
GSE198256_DESeq2_Paper <- GSE198256_DESeq2[keep_paper,]
nrow(GSE198256_DESeq2_Paper)
## [1] 15573
Now, lets go back to our initial filter and perform the DEG analysis.
GSE198256_DESeq2_F<- DESeq(GSE198256_DESeq2_F)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 13 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
GSE198256_res <- results(GSE198256_DESeq2_F)
#Checking we have the correct contrasts, allways control as reference.
resultsNames(GSE198256_DESeq2_F)
## [1] "Intercept" "Group_Acute_vs_Control"
## [3] "Group_Early_recovery_vs_Control" "Group_Late_recovery_vs_Control"
Now, lets explore the number of DEGs we obtain in each contrast.
Initial DEGs Acute vs Control
summary(results(GSE198256_DESeq2_F, contrast = c('Group', 'Acute', 'Control')))
##
## out of 14055 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 1093, 7.8%
## LFC < 0 (down) : 1442, 10%
## outliers [1] : 7, 0.05%
## low counts [2] : 0, 0%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Initial DEGs Early recovery vs Control
summary(results(GSE198256_DESeq2_F, contrast = c('Group', 'Early_recovery', 'Control')))
##
## out of 14055 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 1184, 8.4%
## LFC < 0 (down) : 1246, 8.9%
## outliers [1] : 7, 0.05%
## low counts [2] : 0, 0%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Initial DEGs Late recovery vs Control
summary(results(GSE198256_DESeq2_F, contrast = c('Group', 'Late_recovery', 'Control')))
##
## out of 14055 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 40, 0.28%
## LFC < 0 (down) : 98, 0.7%
## outliers [1] : 7, 0.05%
## low counts [2] : 11437, 81%
## (mean count < 780)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Now lets explore a little our data with the MAplot
plotMA(GSE198256_res, ylim=c(-2,2))
Interpretation From the previous plot we can see that at low A (x-axis) the M (y-axis) is more variable, in other words, from genes with few counts (low A) we got very variable LogFC (M), therefor most of this are errors. As expected the more we move to the right on the x-axis ( more counts) the plot get more thigh, so the LogFC is less variable therefor the error is lower.To correct this problem, we shrink the fold changes, which consist in looks at the largest fold changes that are not due to low counts (the right on the x-axis) and uses these to inform a prior distribution. So the large fold changes from genes with lots of statistical information are not shrunk, while the imprecise fold changes (from genes with low counts) are shrunk.
#now shrink the fold changes
res_acute <- lfcShrink(GSE198256_DESeq2_F, coef = c('Group_Acute_vs_Control'))
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
res_early <- lfcShrink(GSE198256_DESeq2_F, coef = c('Group_Early_recovery_vs_Control'))
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
res_late <- lfcShrink(GSE198256_DESeq2_F, coef = c('Group_Late_recovery_vs_Control'))
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
Let’s plot again the MAPlot to look at the effect of shrinking.
Acute vs Control
Early recovery vs Control
Late recovery vs Control
Now, let’s filter our DEGs using the same criteria as the original author: FDR<0.5 and |FC| > 2. Notice, A FC of 2 correspond to a log2fc = 1, since log2(2) = 1.
deg_acute = as.data.frame(res_acute)[which(res_acute$padj < 0.05 & abs(res_acute$log2FoldChange)>1),]
nrow(deg_acute) # 803 degs
## [1] 803
deg_early = as.data.frame(res_early)[which(res_early$padj < 0.05 & abs(res_early$log2FoldChange)>1),]
nrow(deg_early)# 574 degs
## [1] 574
deg_late = as.data.frame(res_late)[which(res_late$padj < 0.05 & abs(res_late$log2FoldChange)>1),]
nrow(deg_late)# 16 degs
## [1] 16
Comparison with the paper As expected, the number of DEGs we obtain is not the same as the number of DEGs reported in the paper, since we have use different filtering criteria of the counts matrix and we did shrink (The paper does not specify if they did it also). We got more DEGs (n=803) in the contrast Acute vs Control while the authors reported 339 DEGs for the same contrast. In the second contrast Early recovery vs Control we have slighly higher number of DEGs (n=574) than those reported in the paper (n=521). For the last contrast, Late recovery vs Control we got only 16 DEGs.In concordance, the paper’s authors mentioned they found “harldy any DEG” however they don’t specify the number.
I tried to recreate figure 6 from the paper. Since is the main figure describing the transcriptomic profile of the monocytes during COVID19 recovery.
## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.16.0
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
##
## If you use it in published research, please cite either one:
## - Gu, Z. Complex Heatmap Visualization. iMeta 2022.
## - Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
## genomic data. Bioinformatics 2016.
##
##
## The new InteractiveComplexHeatmap package can directly export static
## complex heatmaps into an interactive Shiny app with zero effort. Have a try!
##
## This message can be suppressed by:
## suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================
## ========================================
## circlize version 0.4.15
## CRAN page: https://cran.r-project.org/package=circlize
## Github page: https://github.com/jokergoo/circlize
## Documentation: https://jokergoo.github.io/circlize_book/book/
##
## If you use it in published research, please cite:
## Gu, Z. circlize implements and enhances circular visualization
## in R. Bioinformatics 2014.
##
## This message can be suppressed by:
## suppressPackageStartupMessages(library(circlize))
## ========================================
first I recreate the PCA
#Transform the data for visualization
vsd <- vst(GSE198256_DESeq2_F, blind=FALSE)
pca <- plotPCA(vsd, intgroup=c("Group"), returnData = TRUE)
ggplot(pca,
aes(x = pca$PC1,
y = pca$PC2,
color = pDataUSE$Group)) +
geom_point() +
labs(x = "PC1: 41% variance", y = "PC2: 17% variance") +
stat_ellipse()
Comparison with the paper The distribution of the samples and the clusters in my PCA follows the same pattern as the PCA from the paper, there is a slighty difference in the variance percentage described by PC1 and PC2.
Then I recreate the heatmap
#subset expression table to degs in each category.
degslist <- paste(c(rownames(deg_acute),rownames(deg_early), rownames(deg_late))) #1339 genes
#Keep only unique ids to avoid ploting same gene multiple times
degslist <- unique(degslist) #1314 genes
#Produce the z-score
avsd <- assay(vsd)
Z <- t(scale(t(avsd)))
#Extract my genes of interest - DEGs
Z_degs <- Z[rownames(Z) %in% degslist,]
ha = HeatmapAnnotation(
df = pDataUSE$Group,
annotation_height = unit(4, "mm")
)
#seed for the clustering, this for reproducibility
set.seed(123)
hm <- draw(Heatmap(Z_degs, name = "z-score", km = 5, top_annotation = ha,
show_row_names = FALSE, show_column_names = FALSE, cluster_columns = FALSE))
Interpretation and Comparison with the paper I have use the k-means method to produce 5 clusters as in the paper. The number of elements in my clusters differ form those proposed in the paper.Initially, because the DEGs are different. However I have a similar pattern of expression in my clusters.To corroborate, is necesary to verify weather the representative genes of each cluster are also present in my heatmap and if belongs to the clusters with similar paterns of expression (color). For this, I have extract the ids of the genes en each cluster using a code available in:https://github.com/jokergoo/ComplexHeatmap/issues/136 published by the user guidohooiveld
rcl.list <- row_order(hm) #Extract clusters (output is a list)
# loop to extract genes for each cluster.
for (i in 1:length(row_order(hm))){
if (i == 1) {
clu <- t(t(row.names(Z_degs[row_order(hm)[[i]],])))
out <- cbind(clu, paste("cluster", i, sep=""))
colnames(out) <- c("GeneID", "Cluster")
} else {
clu <- t(t(row.names(Z_degs[row_order(hm)[[i]],])))
clu <- cbind(clu, paste("cluster", i, sep=""))
out <- rbind(out, clu)
}
}
#check
head(out,n=3)
## GeneID Cluster
## [1,] "2867" "cluster1"
## [2,] "4923" "cluster1"
## [3,] "2184" "cluster1"
Then I exported the list of Ids in a text file to annotate the gene name in BioMart and compare with the cluster’s representative genes reported on the paper.
##export to later look for the gene name in bioMart and compare with the
#representative genes per cluster from the paper.
#write.table(out, file= "gene_clusters.txt", sep="\t", quote=F, row.names=FALSE)
#dim(out)
out<-as.data.frame(out)
#Load the annotations obtained from biomart
annotclust <- read.csv("gene_names_clusters.txt",sep="\t",header = T)
head(annotclust, n=3)
## GeneID Gene.name
## 1 100124536 SNORA38B
## 2 100128059
## 3 100128164 FHL1P1
#dim(annotclust)
# [1] 1322 2
#bigger than expected, so we remove any duplicated data
annotclust <- annotclust[!duplicated(annotclust$GeneID), ]
sum(annotclust$GeneID %in% out$GeneID) #1314 ok
## [1] 1314
#Then merge the list of genes on our clusters with the annotations of biomart to pair gene id and gene name
genes_cluster=merge(annotclust, out, by.y="GeneID", all.x=TRUE)
head(genes_cluster, n=3)
## GeneID Gene.name Cluster
## 1 19 ABCA1 cluster4
## 2 133 ADM cluster1
## 3 135 ADORA2A cluster4
Now that we have the gene name of our genes of interest, lets create a vector containing the names of the representative genes reported in Figure 6.B from Brauns et al, 2022. The representative genes reported on the paper are 84. The next step is subset our genes_cluster to extract those annotated as representative in the paper.
rep_genes_paper <- c("DHCR24", "ACSL1", "ALOX15B", "HPGD", "MSMO", "LDHA1", "SCDDHFR", "MPO", "S100A12",
"S100A8", "CLU", "HMGB2", "VSIG4", "FCGR1A", "LAIR1", "CD163", "IFI27", "IL1R2",
"FLT3", "SQLE", "SPRY2", "PPARG", "CCL2", "CCL4", "CCL7", "CXCL1", "CXCL16", "CXCL2",
"TNFRSF12A", "CSF1", "IL1R1", "TNFSF14", "TNFSF8", "DDIT3", "EIF2AK3", "FOXO3", "HMGA1",
"JARID2", "IRF2BP2", "IRAK2", "IL1RN", "DOT1", "SOCS6L", "HAVCR2", "MAFF", "MAFB", "IL1B",
"GPR183", "CIITA", "C3", "FTH1", "KLF12", "HLA-DMA", "CLEC10A", "CSF1R", "HLA-DMB",
"MAP3K14", "GPR183", "HLA-DQA1", "CD4", "VEGFA", "HLA-DQA2", "OSMa", "PTGER4", "HLA-DPA1",
"CLEC4A", "CSF1R", "HLA-DOA", "SRCCAMK2D", "KLF11", "HLA-DRA", "NRP1", "PDGFC", "HLA-DRB1",
"NR4A1", "NFE2L3", "ATF3", "JUNB", "NFKB2", "PLAG1", "METTL17", "APAF1", "HIST1H3C", "ZNF490")
#look for the representative genes in my clusters.
my_rep_genes <- subset(genes_cluster, genes_cluster$Gene.name %in% rep_genes_paper)
head(my_rep_genes, n =3)
## GeneID Gene.name Cluster
## 12 247 ALOX15B cluster1
## 23 467 ATF3 cluster4
## 41 718 C3 cluster4
In our clusters we identified 59 out of the 84 representative genes reported on the paper. And our genes partially share the clustering patern, for example in the paper the “HLA” family of genes are all clustering together, which is in agreement with our observation where the HLA family also belongs to the same cluster. Similarly, cluster II from Brauns and cluster 5 from has show a siliar pattern of expression and contains multiple genes in common. For more details refere to the comparative figure below.